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ABSTRACT 



Tbe present invention relates to a pattern recognition system 
which uses data fusion to combine data from 4 plurality of 
extracted features and a plurality of classifiers. Speaker 
patterns can be accurately verified with the combination of 
discriminant based and distortion based classifiers. A novel 
approach using a training set of a "leave one our data can 
be used (or training the system with a reduced data set. 
Extracted features can be improved with a pole filtered 
method for reducing channel effects and an bIEdc transfor- 
mation tor improving the oorrclitioa between training and 
testing data. 

24 Claims, 16 Drawing Sheets 
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SPEAKER VERIFICATION SYSTEM USING 
DECISION FUSION LOGIC 

GOVERNMENT PATENT RIGHTS 

The U.S. Government has a paid-up license in Ibis inveo- 5 
u'od and the right la limited circumstances to require the 
pate at owner to license others on reasonable terms as 
provided for by the terms of Grant No. P30602-91-C-0120 
awarded by Ibc U.S. Aiir Force. ^ 

DACKORO UND OF THE INVENTION 

1. Field of the Invention 

Hie present invention relates to a pattern recognition 
system and, io panicul&r, to » speaker verification system is 
which uses data fusion to combine data from a plurality of 
extracted features and a plurality of classifiers for accurately 
verifying a claimed identity. 

2, Description of the Related Art 

Pattern leoognilioa rclatcp to identifying a pattern, suet as 20 
speech, speaker or image. An identified speaker pattern can 
be used id a speaker identification system in order to 
determine which speaker is present from an utterance. 

The objective of a speaker verification system is to verify ^ 
a speaker's claimed identity from an utterance. Spoken input 
to the speaker verification system can be text dependent or 
text independent. Text pendent speaker verification, sys- 
tems identify the speaker after the Utterance of a pre- 
determined phrase or a inssword. Text independent speaker M 
verification systems identify (be speaker regardless of the 
utterance. Conventional text independent systems are more 
convenient from a user standpoint in dial there is do need for 
a password. 

Feature extractions of speaker information have been 35 
performed with a modulation model using adaptive compo- 
nent weighting al each frame of speech, as described in the 
co-pending application entitled "Speaker IdcoUficatioa \fcri- 
ficatiou System'*, US, $cr No. 08/203,988, assigned to a 
corooKm assignee of ikit disclosure and hereby incorporated ^ 
by reference into this application. The adaptive coaipoccat 
weighting method attenuates non-vocal tract components 
and normattauz; speech components for improved speaker 
recognition over a channel 

Other conventional feature extraction methods include 4$ 
determining ccpstral coefficients from the frequency spec- 
trum or linear prediction derived spectral coding coeffi- 
cients. Neural tree networks (NTN) have been used with 
speaker-independent data to dctcrrrune discriminant based 
inlerspcakcr parameters. The NTN is a bicrarchial classifier 
that comb iocs the properties of decision trees and neural 
networks, as described io A. Saafcar and R. J. Mamroonc, 
"Growing and Pruning Neural Tree Networks**, IEEE Trans- 
actions on Computers, 042:221-229, March 1993. For 
speaker recognition, training data for the NTN consists of 55 
data for the desired speaker and data from other speakers. 
The NTN partitions feature space into regions (bat are 
aligned probabilities which reflect how likely a speaker is 
to have generated a feature vector that falls within the 
speaker's region- Text independent systems have the disad- go 
vantage of requiring a largo magnitude of data for modeling 
and evaluating acoustic features of the speaker. 

VS. PaL No. 4,957.961 describes a neural network which 
can be readily trained tu reliably recognise connected words. 
A dynamic programming technique is used in which input <*$ 
neuron units of an input layer are grouped into a multilayer 
neural network. For rucotuiition of an input pattern, vector 



103 

2 

components of each feature vector axe supplied to respective 
input neuron units of one of the input layers that is selected 
from three consecutively numbered input layer frames. An 
intermediate layer connects the input neuron writs of at least 
two input layer frames. An output neuron unit is connected 
to the intermediate layer. An adjusting unit is connected to 
the intermediate layer for adjusting the fapin-infermcdialc 
and intermediate-output connections to make (be output unit 
produce an output signal. The neural network recognizes the 
input pattern as a preo^ternnned pattern when (he adjusting 
onit maximizes the output signal* About forty times of 
training are used in connection with each speech pattern to 
train the dynamic neural network. 

It has been found that the amount of data needed for 
training and testing a verification system can be reduced by 
using text-dependent speaker utterances. One conventional 
text dependent speaker verification system uses dynamic 
time warping (DTW) for time aligning the diagnosis of 
features based on distortion, see 5. Fund, "Cepstral Analysis 
Technique For Automatic Speaker Verification*, tEEE 
Transactions on Acoustics* Speech, and Signal Processing 
ASSP-29:254-272, April 1981. A reference template is 
generated from several uUcrances of a password during 
testing, A decision to accept or reject the speaker's claimed 
identity is made by whether or not the distortion of the 
speaker's utterance falls below a predetermined threshold. 
Tbb system has the disadvantage of lacking accuracy. 

Another technique using hidden Markov models (HMM) 
has provided improved performance over DTW systems, as 
described in J. J. Naik, L. P. Nctsch, and 0. R. Doddingion. 
"Speaker Verification Over Long Distance Telephone 
Lines", Proceedings JCASSP (1989). Several forms of 
HMM have beeo used in text dependent speaker verification. 
For example, subword models, as described in A. E. 
Rosenberg, C. H. Ixc and F. K. Soong, "Subword Unit 
Talker Verification Using Hidden Markov Models", Pro- 
cartings ICASSP, pages 260-272 (1990) and whole word 
models A. E. Rosenberg, C H. Lee and 5. Gokccn, "Con- 
nected Word Talker Recognition Using Whole Vftrd Hidden 
Markov Modcis", Proceedings ICASSF, pages 381-384 
(1991) have been considered for speaker verification. HMM 
techniques have the limitation of generally requiring a large 
amount of data to sufficiently estimate the model parameters. 
One general disadvantage of DTW and HMM systems is that 
they only model the speaker and do not account for mod- 
eling data from other speakers using the systems. The failure 
of discriminant training makes it easier for an im poster to 
break into these systcms- 

U is desirable to provide a pattern recognition system in 
which a plurality of extracted features can be combined in a 
plurality of pre-cklcrmined classifiers for improving (he 
accuracy or recognition of the pattern. 

SUMMARY OF T1I12 INVlJNilON 

Bricily described, the present invention comprise* a pat- 
tern recognition system which combines a plurality of 
extracted features in a plurality or cb&siners including 
classifiers trained with different and overlapping substrates 
of the training data for example, a "leave one out** technique, 
described below. Preferably, the pattern recognition system 
is used for speaker verification in which features are 
extracted from speech spoken by a speaker. A plurality of 
classifiers are used to classify the extracted features. 'Inc 
classified output is fused to recognize the similarities 
between the speech spoken by the speaker and speech stored 
in advance for the speaker. From the fused classified output 
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a decision is made as to whether to accept or reject the FIG. 10B is a schematic diagram or a subword based 

speaker. Most preferably, tbc speech is classified with the classification system during testing, 

fusion of a dynamic lime warping classifier for providing pj<3. jia is a schematic diagram of a prior art channel 

validation of the spoken password and a modified neural tree normalization system. 

ntawmfc d^ificr for providing discnmwatfoo front other s p, 0 . ^ te a schematic diagram of a channel normal- 
speakers. The «*e of a AM ***** classifier m a izalioa fiystcm of ^ pfCSCOl i^ atioA( 
speaker verification system has the advantage of accurately __ /_ . . "\ . . . „ 
discriminating one speafer from other speakers, ^ 12 * a graph of a pole filtering channel normahza. 

The system can also include a preliminary determination 4 . _ „ 

of whether or pot to accept or reject the speaker based oo » nc - m 13 a of a sncctr * of a frame of 

perfintning word recognition of a word spoken by the FIG. 13B is a graph of a spectra of a frame of speech for 

speaker, Le.» the speaker's password. If the speaker's pass- » oonnahzauon system of the present invention versus a 

word is accepted, the classifiers arc enabled. Preferably, the frame from a prior an normalization system, 

classifiers are trained by applying a plurality of uUcrajtccs to FIG. 14 is a schematic diagram of as affine transformation 

the classifier with one af the utterances being lell out. The 15 system. 

leQ out utterance can be applied to the classifier to determine 

a probability between 0 and 1 for identifying the speaker. DETAILED DESCRIPTION OF THE 

The probabaities can be compared against a classifier PREFERRED EMBODIMENT 

threshold hi make a derision whether to accept or reject the CruKng the coarse of this description like numbers win be 

speaker: n$ed to identify -like elements according to the different 

The text uttered by (be speaker can be speaker dependent figures which illustrate the invention, 
or speaker n^penrtoL The^cted fea^s cinalso be nc ± mttSlfalcs t diagram of an embodiment 
sorted into subwoids. Preferably, the stibword a of a ^ vcrificjltioD sy5icm 10 in accordance with the 
pbonemc.JEac^flhesi^^ teachings of the present invention. Speaker U titters speech 
^ da ^*^ , ^ &flm !^^^ b ^^^ eK ^ 12. Spekh 12 is applied as speech input signal 13 io feature 
be fused for providing a subword based verification system- extraction module 14. Feature extraction module 14 deter- 
Preferably, the features can be extracted with a pole mincs feature vectors 15 reprcseouttve of charac- 
filteriflg method for decreasing channel effects on the teristic parameters of speech input signal 13. Preferably, 
speech. Id addition, the extracted features can be adjusted ^ fou^ vrx^gra 15 are iktrnnined with a Huear 
with an alone Ira osfon nation for reducing ihc mismatch prediction (LP) analysis to determine LP cepstrat coeOi- 
between training and tailing environments. c i CDbt The LP ccpsUal coefficients can be band pas* littered 
The invention will be more fully described by reference to using a raised sine window with conventional techniques for 
the following drawings. providing improved recognition of the cepstral coeflSdcoK 

BRIEF DESCRIPTION OF TOE DRAWINGS * fc Alternatively, or in combination with the LP analysis. 

tcarure extraction module 14 can extract features with a 

FIG- 1 is a schematic diagram of a speaker verification plurality oC methods. For example, an adaptive component 

system in accordance wtth the teachings of the present weighting method as described in the above- identified U.S. 

invention. Scr. No. 08/205,988 can be used to extract speech feature 

FfG. 2A is a schematic diagram of the word recognition ^ vectors 15. The adaptive component weighting technique 

module shown in PlG. I during training of the system. enhances extracted features by applying weightings to pre- 

FIG. 2B is a schematic diagram of (he word recognition determined components of the speech input signal 13 for 

module shown in FIG. 1 during testing of the system. producing a normalized spectrum which improves vocal 

FIG. 3 is a schematic diagram of a speaker verification [TXi features of the signal while reducing noD^ocal tract 

module combining a plurality of extracted features with a 45 effects. Feature extraction module 14 can also generate other 

plurali ty of classifiers. linear prediction derived features from linear prediction (LP) 

FIG, 4 is a schematic diagram of the combination of Mwrt&wl methods such as log area 

modified neural tree network and dynamic time warping rauoMirics^ coefficient*. Feature 

classifiers used in the speaker verification module shown in U can also generate Fast Pounce trans- 

pjG l 50 form (rrT) denved .spectral features on Linear and log 

_ . . 4 . _ , ... . ^ . , frequency scales, fundamental frequency (pitch), loudness 

n °I^r^ '"/"i"^ t i ^ r ^ cockcie^t and^ero crossing rates 

network (MrfFN) classifier used tn the speaker verification , . * . 

* Jk~..,~ :„ t Word rccomtiion module 20 receives speech feature 

module shown in no. I. , .» t 4L . r , _ . ~ «~ » JL 

^ ^ _ . , ... _ . \ . vectors 15 and compares the speech feature vectors 15 with 

mSft* f * *^" d ^*« r » d >™ «™ » data 16 related to the speech fcatore victors IS. Data 16can 

(DTW) classmer used in a speaker verification module bc |n ^ ^ cxarnpt0j spcakcr „ ^ micr 

shown in I IG. 1. a passwDnl ^ apcccn \2, Speech realurc vectors 15 rcpre- 

nG.7Ai»ascbcmatir;diagramora plurality of utterances ^ ^ utlenmcc df me password for speaker 11. A closed 

used io training of ibe speaker verification module. ot pa^wwords can be represented by data 16 and stored in 

FIG, 7H is a schematic diagram of the application of the ^ datahase 50. Itic closed set of passwords corresponds to a 

plurality of utterances shown in FIG. 7A in the speaker ^ of identities, including tte password for speaker 

verification module. u. At word recognition module 20 t tf the received speech 

FIG. 8 is a graph of a speaker and other speaker scores. feature vectors 13 at word recognilioo module 20 match data 

IIG. 9 is a schematic diagram of a sul^worxl based speaker ig stored in database 50, Cor example, a match of x password 

verification system. 6$ for a claimed identity,, speaker verification module 30 U 

IIG, 10A is a schematic diagram of 1 subword based enabled. If the received speech feature vectors 15 do not 

classification system during training. match data 16 stored in database 50, for example, no match 
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of a password is stored in database SO for the claimed Tree Network (NTN) classifiers 104, 106, 106 and U0 a ad 

identity, user U can be prompted to call again in module 21. Dynamic lime Warping (DTW) classifiers 120, 122, 124 

Speaker verification module 30 preferably uses data 12*. During classification, each NTN classifier 104, 

fiision to combine a plurality of classifiers with speech 1W S 108 and 110 and 126 determines if feature vector 102 

feature vectors 15, which technique is described in detail * * above a jmrtcterouncd respective threshold, TW of 

below. Fused classifier outputs 35 of speaker verification NTN stored id database 132. Bach DTW classifier 120, 122, 

module 30 is received at decision fusion logic module 40. 124 and 126 determines if feature vector 102 is above a 

Decision fusion logic module 40 provides the final decision predetermined respective threshold u T oty f of DTW stored 

on whether h> accept or reject (he claimed ideality of speaker ™ *>to base 132. If feature vectors 102 are above respective 

11, thereby verifying the speaker's claimed identity. 10 thresholds T^ and T^, a binary output of a l* is out- 

FIGS. 2A and 2B illustrate word leeognition module 20 

durinjz enrollment of speaker 11 and testing of speaker 11, 1 J* WC ks ? ?*■ P^tonwn^re^pcctive threshold 

^X^inz Ailment of speake? J^^^l^S 1 ^^^^ 

verification system 10, training speech 22 is uttered by ic ^ ™ Ufle r ^ ll | Vc2y ' < _ , . 

speaker U, For example, training speech 22 can comprise 15 °* speaker 11 with speaker verification 

four repetitions of a password for speaker 11. Each of the system 10, decision module 40 receives the binary outputs 

repetitions is recognized with word matching recognition fro™ 15nc 240 241 • ln * preferred embodiment of 

module 28. Preferably, a DTW-^ased template matching decision module 40, a majority vote can be taken tm the 

algorithm is used in word matching recognuion module 28 binary output* in decision module 240 to determine whether 

to produce rccogfiiasd words 23. Recognized words 23 are 20 *> w«f< or reject .speaker 11. In this embodiment, if the 

clustered into a speaker dependent template 24. Speaker majority of the binary outputs arc "1 M , the Spealosr is 

independent templates 26 can also be generated with rcc- acccptedaod if the majority of the binary outputs arc *0",tnc 

ognized words 23 and daU of repetitions of the same training speaker U rejected. 

speech 22 spoken by other speakers 25 using speaker A preferred classifier designated as a modified neural tree 

verificaiSofl system 10. A majority vote on recognized words network (MNTN) 200 can be used as a discriminant based 

23 from word recognition matching module 28 can be used classifier in speaker verification module 30. MNTN 200 has 

to identify a user's password 27 for speaker U a plurality of fotareonnected nodes 202, 204 and 206. as 

During testing of speaker 11. speech 12 b spoken by user *bowq i» FIG. 5. Node 204 is coupled to leaf node 208 and 

U coi^ared against speaker dependent template 24 , Q lear node 210 and node 296 is coupled to leaf node 212 arid 

and speaker irulcpendcnt template 26 m word recognition cat node 214LA P^babihty iiicasuicmcnt is used a < each of 

matching module 28. If speech 12 represents password 27 of ^af nodes 208, 210, 212 and 214 becau* oT "forward 

speaker 11 and matches cither the speaker dependent word pruning* of the tree by truncating the growth of MNTN 200 

template 24 or speaker ^dependent word template 26\ an beyond a predetermined IcveL 

"accept" response is outputtcd to tine 29. If speech 12 does 3J MNTN 200 is trained for speaker U by applying data 201 

not match cither the speaker dependent word template 24 or from otter speakers 25 using speaker verification system 10. 

the speaker independent word template 26, a "reject" Extruded feature vectors 1 5 for speaker 1 1 identified a* "S/*, 

response is outputtcd to hue 29. are assigned labels of "1" and extracted feature vectors for 

Preferably, speaker verification module 30 use* data on^r sneaky 25 u^ 

fusion tocor^u^ 40 "■UP* ^ °f J^A^r 

62 with a plurality or ckissificrs 70. 71 and 72, as shown in *PP^ respectively to leaf nodes 208, 210, 212 and 214 of 

FIG. 3. Features 60, 61 itnd 62 can represent speech feature ^^J^!^ E^?" * ^^t^l^J 

vectors 15 extracted with varying predetermined extraction nodes 208, 210, 212 and 214.^ of leaf no^s 208,210, 

methods as described above. Classifies 70, 71 and 72 can 212 and 214 b ss«^ the label of the majority of the ^lc. 

represent varying predtt:rmiocd classification methods such * A "confidence" is defined as tbe ratio of mo prober of labels 

as, for cxarnpk? a neural tree network (NTN), multilayer for the majority to the total number of labels. For example, 

perception <MU>), bidden markov Models (HMM), ^ » wtaict iconrpnses eight V*^*™^* 

dynamic time waring (DTW), Gaussian mixtures model M>cl of «fT and a confidence of "UT Data 230 which 

(GMM) and vc*to7o^lkalion (VQ), In an alternate o^p^^rjp^^d four ^ features is assigned 

cmbcdimcDt, features 60. 61 and 62 can represent extraction so a label of «r aiki a confi<teDoe of -0.6^. 

features of an alternative pattern such as speech or image A trained MNTN 208 can be used in speaker verification 

and classifiers 70, 71 and 72 can represent predetermined module 30 to determine a corrosponolng speaker score from 

classification methods for the speech or image patterns, a science of feature vectors *X" from speech 12. Ihc 

Output 73, 74 and 73 from respective classifiers 70, 71 and wrreponding speaker score P lWmv (X/S,) can be deter- 

72 can be combined in decision fusion logic module 40 to ss mined with the following equation: 
make a final decision on whether to accept to accept or reject 

speaker 1 1. Decision fusion module 40 can use conventional c i 

icctaiuucs, like linear opinion pool, log opinion pool. Bay- Pumt&i) - „ /-1 ' 

sian combination rules; voting method or an additional i <:/>♦ £ cf 

classifier to combine classifiers 70, 71 and 72. It will be eo >-i /-t 
appreciated that any number of features or classifiers can be 

combined. The classifiers can also include classifiers trained where speaker U is identified as S„ c 1 are the confidence 

with different and overlapping substrates of training data, for score for speaker 11 T c° is Ihe confidence score for all other 

example, the leave one out technique described below. speakers. M and N correspond to the numbers of vectors 

1?1G. 4 illustrates a piefcrrcd speaker verification module 65 classified as "1" and u ty\ respectively. 

30 for use in the spcaknr verification system of the present A preferred DTW classifier uses a distortion based 

invention. Speech feature vectors 102 arc inputted to Neural approach for time aligning two waveforms or two feature 
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patterns, as shown in FIG. 6. The waveforms are represented speaker score 361, Thresholds, T, can be determined from 

by o, reference pattern of speech feature vector* 15 OA the X the following equation: 

axis and a test pattern of speech feature vectors 15 on the Y T^Enicwp^r+yniuei™^ 

axis, wherein N represents the cumber of reference patterns 

and M represents the Dumber of test patterns. Global coo- 5 A soil score, S, can be determined by the amount thai 

straints 270, 271, 272 and 273 represent limits for the speech 12 is greater than or kss than Threshold, T. A score 

dynamic dote warping path 275. Dynamic lime warping of each classifier, C, is bctweco zero and one with zero being 

path 275 caa be determined by conventional methods such the most confident reject and one being the most confident 

as described in it. Sakoe and S. Cfciba, "Dynamic program- accept The accept confidence, is between the 

ming algorithm optimization for spoken word recognition", 10 threshold T, and one can be defined from the following 

IEEE Tram, on Acoustics* Speech and Signal rrvces&ig, equation: 
vol. ASSP-26, no. 1, pfiS. 43-49, February 1978. 

It is preferable to combine t classifier which is based on c^,- S ~Z 

a Distortion method, i.e., a DTW classifier to provide infer* 1 ~ ' 
nation related to the speaker and a classifier based on a is 

discriminant method, NTN or MNTN, classifiers to provide A reject confidence, C Wrtfn , is between 0 and threshold T 

information related to the speaker with respect to other can be defined as: 

speakers' using the speaker verification system ID. The 

toskm or a DTW classifix and a MNTN or NTN classifier C^-JC^L- 

also has the advantage that (be DTW classifier provides 20 

temporal information which is not generally part of the NTN 

or MNTN classifiers, FIG. 9 illustrates a schematic diagram of a subword based 

NTN classifiers 104, 106, 108 and U0 and DTW das&i- speaker verification system 400. After extraction of speech 

fiers 120, 122, 124 and lifi can be trained with training feature vectors 15 in feature extraction module 14, speech 

module i»0 f shown in PIGS. 7Aand 7B. Training module 25 feature vectors 15 are segmented Into subwords 404 in 

300 can also be used for training MNTN classifiers, DTW subword segmentation module 402. Preferably, subwords 

classifiers and other drifters which can be used in speaker 404 are phonemes. Subwords 404 can be applied to train 

verification module 30. A resampling technique identified as speaker module 406 and test speaker module 40S. 

a "leave one out" technique k preferably used in training- FIG. 10A b a schematic diagram of Ihe subword based 

module 300. A predetermined number of utterances of » speaker verification 400 system during application of the 

training speech are received fiom speaker 11. In this train speaker module 406. Speaker extraction features 15 

embodiment, four utterances, defined as 302, 304, 306 and depicting speaker U training utterances and a password 

308 of speech 22, such as the speaker's password arc used. transcript 410 are applied to subword phoneme level sog- 

A combination of three of the four uueranecs, with one mentation module 402. Password transcript 410 can be 

utterance being left out, are applied to pairs of NTN clas- as spoken by speaker U, inputted by a computer or scanned 

sifiers 104, 106, 108 and 116 ami DTW classifiers 120, 122, from a card, or the like. Speech segmentation module 402 

124 and 126. The three utterances arc used for training the segments speaker extraction features 15 into subwords 1 to 

classifiers and the remaining utterance is used as an iwte- M, Tor example, subword T in module 420, wibword V 

pendent test case. For example, utterances 302, 304 and 306 in module 422 and subword "M" module 424 in which M » 
can be applied to NTN classifier 104 and DTW classifier 40 the number of segmented subworcte- Subwords 420, 422 and 

120; utterances 304, 30* and 308 can be applied to NTN 424 can be stored in subword database 425. Supervised 

classifier 106 and DTW classifier 122, utterances 302, 306 learning vector labeling scheme 430 determines the labels 

and 308 can be applied to NTN classifier 108 and DTW for training speech vectors as "TT or T for training clas- 

classifier 124, and utterances 302, 34M and 308 can be sifiers 440, 442 and 444. For example, all subwords for other 
applied to NTN classifier U0 and DTW classifier 126. 45 speakers 25 can be labelled as "fT and subwords for speaker 

Aflcr application of the respective three utterances to each IS can be labelled as "V. Alternatively, the closest pbo- 

pair of NTN classifiers 104, 106, 108 and U0 and DTW ncmes can be searched in database 425. Subword classifiers 

classifiers 120, 122, 124 and 126, the left out utterance is 44©, 442 and 444 are applied to respective subwords 420. 

applied to carh respective pair of NTN classifiers 104, 106, 422 and 424 for defying each of the subwords. 
108 and 110 and DTW classifiers 120, 122, 124 and 126, as so Preferably, subword classifiers 440, 442 and 444 use NTN 

shown in HC. 7C. For cJtample, utterance 308 is applied to and MNTN classification methods. ^ 

NTN classifier 104 and DTW classifier 120, utterance 302 is FIG- 10B is a schematic design of the subword based 

applied to NTN 106 and D1W 122, utterance 304 is applied speaker verification system 400 during application of the test 

to NTN 108 and DTW 124 and utterance 306 is applied to speaker module 408. Speaker extraction feature 15 depicting 
NTN 110 and DfW J2ti. A probability, P. between 0 and X ss speaker 11 test utterances are applied to subword phoneme 

designated as 310, 312. 314 and 316 is calculated. Prob- level segmentation module 402 with password transcript 

abilities 310, 312, 314 and 316 arc compared agairttt a 410. Subword dassi fieri- 440, 442 and 444 classify respec- 

ihresMd T™ and nrcAibiliticB 317, 318, 319 and 320 ti« subwords 420, 422 and 424 determined from extracted 

in vote module 321 ofdocision fusion logic module 40. speaker features IS depicting ^eaker 11 test utterances. 
HO. 8 is a graph of inlraspeakcr .scores from other *o Output 44S from classifier 440, 442 and 444 is apphed to 

speakers 25 and iolcrspeakcr scores from speaker 11 which decision fusion logic module 40 for drfcrmintm> whether or 

can be used to determine thresholds for the classifiers used not lo accept or reject speaker 11 based on fused output from 

id speaker verification system 10, for example, thresholds classifiers 440, 442, 444 based a calculated accept 

T,„ w and T^ny. The icterspeakcr scores of speaker 11 for confide ace, 

^M>i*pi* ^ uCSCnbco. above. 

speech 12 are represented by graph 350 having mean w A preferred method which cm be desenbed as pole 

speaker score 351. Inimpe&ker scores of other speakers 25 filtering" can be used in feature extraction module 14 for 

for speech 12 are represented by graph 360 having mean yielding speech feature vectors 15 which arc robust to 
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channel differences. Rote tillering performs channel normal- 
izAtion using intelligent filtering of the all polo linear pre- 
diction (LP) filter 

Clean speech C* is convolved with a channel with 
impulse response h, Ihco * channel cepstrum of the ordinary 
ccpstral mean can be rcpjesomod by, 

kf 

where 



corresponds to the ccpstral mean oomponect solely due to 
underlying clean speech- The component due lo clean 
speech should be zero-mean in order for the channel cep- 
strum estimate c s to correspond to ccpstral estimate, h, of the 
actual underlying convolution distortion. 

Il can be empirically shown that the mean cepstrum 
component due to ckan speech is never zero for short 
u Iterances and can be lbs case for training and testing of 
speaker verification system 10. 

A prior art channel normalization system 500 is shown in 
FIG. HA In which speech is applied to intraframe weighting 
module 502. Adaptive component weighting (ACW) is an 
example of an intrafrarne weighting for channel normalisa- 
tion. Weighted speech 5D4 is received at iolcrfnune process* 
ing module 506 for removing additional channel effects. One 
conventional intcrfnosc method Cor removing channel 
effects is by ccpstral mean substruction (CMS). Since the 
channel cepstrum comprises a gross spectral distribution due 
to channel as well as speech, the conventional elimination of 
a distorted estimate of the "W" 1 * 1 cepstrum from the 
cepstrum of each speech frame corresponds to effectively 
deconvolving an unreliable estimate of (he channel. 

FIG. UB illustrates a channel irormal&Uioo system tfOO 
of the present Invention. Speech 12 b applied to channel 
estimate pole filtering module 602. Pole filtering 
de-emphasums the contribution of the invariant component 
due lo speech s s . The refined channel estimate is used to 
normalize the channel. Itoforably, the refining of the chan- 
nel cepstrum can be performed by an iterative manner. 

The estimate of the channel cepstrum, depends upon 
the number of speech Crimes available in the utterance. In 
the case where the speech uuerance available is sufficiently 
long, U is possible to get an estimate of the channel cepstrum 
that approximate the true channel estimate, b. In most 
practical situations* the utterance dux alio as for training or 
testing are never long enough to allow for * 3 -*0. The 
ccpstral mean estimate cue be improved by determining the 
dominance of the polcri in the speech frame and their 
contribution to the estimate of the channel cepstrum. 

The effect of each mode of the vocal trad on the ccpstral 
mean is determined by converting the ccpstral mean into 
linear prediction coefficients and studying the dominance of 
corresponding complex conjugate pole pairs, A spectral 
component, for a frame of speech, is most dominant if il 
corresponds to a complex conjugate pole pair closest to the 
unit circle (minimum bandwidth) and least dominant if il 
corresponds to a complex conjugate pole pair furthest from 
the unit circle (maximum bandwidth). 

Constraining the poles of speech in order to acquire a 
smoother and hence a more accurate inverse channel cli- 
mate in the cepstral domain, corresponds to a modified 
ccpslral mean, 



,103 

10 

thai de-emphasises the cepstral bias related to the invariant 
component due to the speech, 'lite refined cepstral mean 
removal, devoid of the gross spectral aUstribution component 
due to speech offers an improved channel normalization 
5 scheme. 

The channel estimate best determined from channel poles 
filtering module 602 is combined with speech 12 in decon- 
vnlation module 73$ for dcconvulatton in the time domain 
to provide oonnalized speech 735. Conventional mterfSrame 

to coupling 502 and Interference piorrvcmg 506 can be applied 
to normalized speech 735 to provide channel normalized 
Speech feature vector 740. Speech feature vector 740 can be 
applied in a similar manner as speech feature vectors 15 
shown in FIG. 1. One preferred method for improving the 

is estimate of the channel uses pole filtered cepstral 
coefficients, FFCC, wherein, the narrow band poles are 
inflated in their bandwidlhs while their frequencies are left 
unchanged, as shown in FIG- 12. Poles 80t, 802. 803, 854, 
8*£, 806. are moved io modified poles 811, 812, 813, 814, 

30 815 and 816. The effect is equivalent to moving the narrow 
band poles inside the unit circle along the same radios, thus 
keeping the frequency constant while broadening the band- 
widths. 

Pole filtered cepstral coefficients, PPCC are determined 

25 for speech concurrently with speech feature vectors 15. Pole 
filtered cepstral coefficient*, PFCC am determined by ana- 
lyzing if a pole in a frame 12 has a bandwidth less than a 
predetermined threshold, U If the speech 12 is less than the 
predetermined threshold and the bandwidth of thai pole is 

50 clipped to threshold, t. The polo filtered cepstral coefficients 
can be used to evaluate the modified ccpstral means. An 
improved inverse Gllcr estimate is obtained by using means 
of Pole Filtered Cepstral Coefficients PFCCs which better 
approximates the true inverse channel filter. Subtracting the 

35 modified cepstral mean from cepstral frames of speech 
preserves the spectral information while more accurately 
compensating for the spectral lilt of the channel. 

FIG. 13A illustrates a sample spectra 700 of a frame or 
speech. RG. 13B illustrates spectra 710 or a prior art 

«o cepstral mean C s subtracted from spectra 700. Spectra 720 
is a pole filtered modified cepstral mean c/'subtractcd from 
spectra TOO. Spectra 720 shows improved spectral informa- 
tion over spectra 710. 

FIG- 14 illustrates alEoe Ira Deformation system 900 which 

as can be used with training and testing of speaker verification 
system 10. The mismatch between I he training and testing 
environments can be reduced by performing an aCEno trans- 
formation oo the ccpstral coefficients extracted with feature 
extraction module 14. An ftifinc transform y of vector x is 

SO defined as 

where A is a matrix representing a linear transformation and 
b a non-zero vector representing the translation, y is the 

55 testing data and x corresponds to the training data, lo the 
speech processing domain, the matrix A models the shrink- 
age of individual cepstral coefficients due to noise and the 
vector b accounts for the n^plaeemeni of the cepstral mean 
duo to the channel eflcctfi. 

so Singular value decomposition (S VD) describes the geom- 
etry of affine transform with the following equation: 

where U and V r arc unitary matrices, and £ is diagonal. Ittc 
65 gwtnetric interpretation is that x is rotated by V\ rescaled 
by X, and rotated again by U. There is also a translation 
introduced by the vector b. 
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assigning a sccem) label to said at least ooc feature from 

said second speech spoken by other speakers; and 
train tug said classifiers on said first and sccood labels. 

9. The method of claim 1 wherein said recognizing step 
comprises: * 

applying to a pair of said plurality of classifiers, a plurality 
of first utterances of speech for said speaker and 
leaving out one of said utterances defined as a left out 
utterances for training said classifiers; 

applying said left out utterances to said pair of classifiers 
for indepeodenlly testing said classifies; 

cajculdtog a first probability for a first one of said 
classifiers in said pair of classifiers and a second 
probability for a second one of said classifiers in said J5 
pair of classifiers; and 

determining a first threshold for said first one of said 
classifiers in said pair of classifiers from said first 
probability and a second threshold Cot said second one 
of said classifiers in said pair of classifiers from said 20 
second probability, 

wherein said sirjoilariiy of said plurality of classified 
output is determined by comparing said first one of said 
classifiers in said pair with said first threshold and said 
second one of said classifiers in said pair with said 
second threshold. 

10. The method of claim 1 wherein said extracting step is 
performed by modifying poles in a pole filter of said first and 
second speech to extract said at least one feature. 

U. The method of claim 10 wherein said poles are 30 
modified by the steps oft 

determining a spectral component of said at least one 
feature; And 

constraining the narrow bandwidth to obtain a channel 3S 
estimate. 

12- The method of claim 11 further comprising the stops 

of; 

dcconvulating said firtt speech and said sccood speech 
with said channel estimate to obtain normalized speech; 40 
and 

computing spectral features of said normalized speech to 
obtain normalized speech feature Vectors which are 
applied to said dassifying step. ■ 

13. The method of claim U further comprising the Steps 45 

ot 

converting said channel estimate to cepstral coefficients to 
obtain a modified channel estimate in a cepstral 
domain; and 

subtracting said modified channel estimate from cepstral 50 
frames of said first speech speech and said second 
speech. 

14. The method or claim 1 further comprising the step of: 
segmenting *aid at Ica^t one feature from said first speech s$ 

{nto a plurality or first suhwords after said extracting 
step. 

15. The method of cliiim 14 wherein said subwortb arc 
phonemes. 

16. The method of claim 14' further comprising the steps ^ 

of: 

extracting at least one feature from second speech spoken 
by other speakers; 



segmenting said at least one feature from said second 

speech into a plurality of second subwords after said 

extracting step; 
storing said first plurality of subwords and said second 

plurality of subwords in a subword database; 
determining from said stored first subwords first Labels for 

said speaker and from said second suhwords second 

labels for other speakers; and 
training said classifiers on said first and second labels. 

17. The method of claim 1 wherein said at least one 
feature is corrected using an a nine map transformation, 
wherein Said flffrnc transformation is represented by 

wherein y is said affinc transform of vector x, A is a matrix 
representing a linear transformation and vector b represents 
the translation. 

18. The method of claim 17 wherein said at least one 
feature arc cepstral coetEcieots and said cepstral coefficients 
are corrected using an affinc map transformation. 

19. A system for speaker verification of a speaker Com- 
prising: 

means for extracting at least one feature from first speech 
spoken by said speaker; 

means for classifying said at least one feature with a 
plurality of classifiers for forming a plurality of clas- 
sified output; 

means for fusing said plurality of classified output for 
forming fused classifier outputs; 

means for recognizing said fused classifier outputs by 
determining the similarity of said fused classifier out- 
puts and second speech spoken by said speaker before 
said speaker verification; and 

means for determining from said recognized fused clas- 
sifier outputs whether to accept or reject said speaker. 

20. The system of claim 19 further comprising: 
means Cor performing word recognition on said first 

speech spoken by said speaker by comparing said at 
least one feature against data for said speaker stored 
before said speaker verification for determining 
whether to preliminarily accept or preliminarily reject 
said speaker; and 
means for enabling said means for classifying said at least 
one feature if it is determined to preliminarily accept 
said speaker or enabling a call back module if it is 
determined to preliminarily reject said speaker. 

21. The system of claim 20 wherein said data comprises 
s speaker dependent template formed from first speech 
spoken by said speaker in advance and a speaker indepen- 
dent template formed of first speech spoken by at least One 
second speaker in advance. 

22. The system of claim 21 wherein said means for 
classifying comprises a modified neural tree network 
(MNTN) and a dynamic time warping classifier. 

23. The system of claim 22 wherein said means for 
extracting is performed by constraining poles in an atl pole 
filter. 

24. 'Jbe system of claim 23 wherein said at least one 
feature is a cepstral oocflfcicol and said cepstral coefficient 
is corrected using an aflmc transformation. 
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